An Online Analytical System for Multi-Tagged Document Collections

نویسنده

  • Grzegorz Drzadzewski
چکیده

The New York Times Annotated Corpus and the ACM Digital Library are two prototypical examples of document collections in which each document is tagged with keywords and significant phrases. Such collections can be viewed as high-dimensional document cubes against which browsers and search systems can be applied in a manner similar to online analytical processing against data cubes. The tagging patterns in these collections are examined and a generative tagging model is developed that can mimic the tag assignments observed in those collections. When a user browses the collection by means of a Boolean query over tags, the result is a subset of documents that can be summarized by a centroid derived from their document term vectors. A partial materialization strategy is developed to provide efficient storage and access to centroids for such document subsets. A customized local term vocabulary storage approach is incorporated into the partial materialization to ensure that rich and relevant term vocabulary is available for representing centroids while maintaining a low storage footprint. By adopting this strategy, summary measures dependent on centroids (including bursty terms, or larger sets of indicative documents) can be efficiently and accurately computed for important subsets of documents. The proposed design is evaluated on the two collections along with PubMed (a held-back document collection) and several synthetic collections to validate that it outperforms alternative storage strategies. Finally, an enhanced faceted browsing system is developed to support users’ exploration of large multi-tagged document collections. It provides summary measures of document result sets at each step of navigation through a set of indicative terms and diverse set of documents, as well as information scent that helps to guide users’ exploration. These summaries are derived from pre-materialized views that allow for quick calculation of centroids for various result sets. The utility and efficiency of the system is demonstrated on the New York Times Annotated Corpus.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

An Online Q-learning Based Multi-Agent LFC for a Multi-Area Multi-Source Power System Including Distributed Energy Resources

This paper presents an online two-stage Q-learning based multi-agent (MA) controller for load frequency control (LFC) in an interconnected multi-area multi-source power system integrated with distributed energy resources (DERs). The proposed control strategy consists of two stages. The first stage is employed a PID controller which its parameters are designed using sine cosine optimization (SCO...

متن کامل

Visualisation globale de collections de documents sous forme d'hypercube - Le système DocCube

This paper introduces a visual method in order to explore a document collection. In this approach, a domain is represented under the form of a set of concepts hierarchies that structures the domain knowledge. Users express their points of interest through these hierarchies. One key points of the approach is the use of a data cube representation (which can be found in OLAP systems). Each concept...

متن کامل

Examination of Vroom’s motivational theory: A new marketing strategy in consumers of online document delivery services: Case study of Shahid Chamran University of Ahvaz

This study aimed to identify and test expectancy motivational model as a theoretical framework to explain the reasons motivating expected information consumer’s behavior for the selection and use of document delivery services of Shahid Chamran University, Ahvaz. In this study, explanatory survey method was used. In order to test the hypotheses and analysis of model’s data, covariance structural...

متن کامل

Generating Image Captions using Topic Focused Multi-document Summarization

In the near future digital cameras will come standardly equipped with GPS and compass and will automatically add global position and direction information to the metadata of every picture taken. Can we use this information, together with information from geographical information systems and the Web more generally, to caption images automatically? This challenge is being pursued in the TRIPOD pr...

متن کامل

Interactive Demo: Stay in Touch with InfoVis – Visualizing Document Collections with Document Cards

Large document collections are essential resources for a wide variety of professionals, like scientists, lawyers, analysts, etc. An electronic document management system can assist them in solving the tedious tasks of curating, browsing, searching, and recognizing documents in these collections. As an initial step in creating such a system, we invented the Document Cards [3] as a mixed image-te...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2015